Consumer-friendly EEG-based Emotion Recognition System: 
A Multi-scale Convolutional Neural Network Approach 


Abstract 

EEG is a non-invasive, safe and low risk method to record electrophysiological signals 
inside the brain. Especially with recent technology developments like dry electrodes, 
consumer-grade EEG devices, and rapid advances in machine learning, EEG is com- 
monly used as a resource for automatic emotion recognition. With the aim to develop 
a deep learning model that can perform EEG-based emotion recognition in a real-life 
context, we propose a novel approach to utilize multi-scale convolutional neural net- 
works to accomplish such tasks. By implementing feature extraction kernels with many 
ratio coefficients as well as a new type of kernel that learns key information from four 
separate areas of the brain, our model consistently outperforms the state-of-the-art 
TSception model in predicting valence, arousal, and dominance scores across many 
performance evaluation metrics. 


1. Introduction 

Emotion is a fundamental part of human life. Associated with feelings, emotions 
are important to one’s decision-making, learning process, and many other cognitive pro- 
cesses [1]. They are also a key tool for human interaction and communication, serving as 
a way for the communicator to express themself, providing information on their state 
of mind, feelings, motives, and intentions [2]. Therefore, the interest in researching, 
learning, and further understanding human emotion and its impact has been growing, 
especially in the field of neuroscience [3]. While much progress is still to be made with 
research regarding the matter due to the complexity behind human emotion [4], many 
advancements have been achieved thanks to the development of technology in the field, 
contributing to novel approaches that researchers can use to further study human emo- 
tion. With more knowledge of emotion, a key to perceiving feelings, expression, and 
cognitive information of a person, there are a number of ways this can be used to improve 
the quality of life, one of which, is therapy. 


In recent years, a number of technological advancements have led to an increased 
interest in a resource that can be used for the task: electroencephalography (EEG), one 
of the most widely used brain imaging technologies. EEG presents a non-invasive way to 
measure brain electrical activities, which can then be passed through a brain-computer 
interface (BCI) to further process the information and identify human emotion. This 
surge in interest is due to the development of consumer-grade EEG devices with dry elec- 
trodes. Before this development, despite its potential, the applications of EEG-BCI sys- 
tems outside of research labs are extremely sparse due to the limitations of research-grade 
EEG devices: (1) it is time-consuming to set up a research-grade EEG device, typically tak- 
ing from 30 to 60 minutes, (2) user’s mobility is heavily restricted due to the high number 
of wires, and (3) the extremely high cost of the devices [5]. Even though data recorded by 
research-grade EEG devices may provide more information and allow EEG-BCI systems 
to yield better results in the task of emotion recognition, consumer-grade EEG devices 
open up countless more research and consumer applications with their affordability, por- 
tability and simplicity, while still providing reliable results [6]. 

With automatic emotion recognition using EEG signals, the use of machine learn- 
ing (ML) algorithms, specifically deep learning (DL), is one of the most popular and re- 
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liable methods due to its known ability to learn non-linear patterns from complex in- 
formation - an aspect that is found in EEG. Thus, we developed a DL model inspired by 
TSception [7] that predicts human emotion based on EEG with reliable accuracy. Our 
method provides a novel approach to which a DL architecture can be used for the task. 


The remainder of this paper is organized as follows. In Section 2, we give back- 
ground information of emotion, EEG, EEG-based emotion recognition, and deep learn- 
ing. Section 3 provides the details of our materials and method by introducing the dataset 
and the performance evaluation metrics that we used, the experimental setup of the DL 
model, and the mechanism behind our model. In Section 4, we present the results of our 
model and analyze how these results can be interpreted. Future implications, suggestions, 
and a discussion regarding our study are given in Section 5. Finally, the conclusion and 
guidance acknowledgement are presented in Section 6 and 7. 
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2. Introduction 
1) Emotion 

Emotion has long been a complicated subject of research. One of the earliest at- 
tempts to explore the mechanism behind human emotions comes from the James-Lange 
theory of emotion in 1884 [8], in which psychologist William James and psychologist 
Carl Lange suggested that emotion is the result of a physiological response to external 
stimuli (e.g., your body trembling and your heart beating rapidly would cause you to feel 
fear). This theory was challenged in 1927 by Walter Cannon and Phillip Bard, who argued 
in the Cannon-Bard theory that emotion and physiological response occur simultane- 
ously, not consequently [9]. The Freudian theory by Sigmund Freud, one of the greatest 
and most well-known theories in psychology, introduced the idea that much of human 
behavior, including human emotion, is greatly influenced by the unconscious mind. In 
the 1950s, the cognitive revolution led to the development of novel theories such as the 
Schachter-Singer Two-Factor Theory of Emotion, which suggests that emotion consists 
of two components: physical arousal and a cognitive interpretation [10]. In the 1960s and 
the 1970s, Richard Lazarus pioneered the advancement of the cognitive appraisal theory, 
stating emotion is the response to an individual’s evaluation of a situation. As the 21st 
century approaches, advances in brain imaging like functional magnetic resonance imag- 
ing led to many studies about the brain’s role in human emotions, the emergence of the 
field of affective neuroscience, and the popularization of emotional intelligence. These 
advancements, with pioneers like Jaak Panksepp and Daniel Goleman, have led to rapid 
progress in extensive research on the role of emotion and the mechanism behind it. 


Emotion plays many roles in human life: (1) Emotions serve as social signals in 
human exchanges, conveying one’s feelings, perceptions, and situational understanding 
to the interaction partner [11]; (2) Emotion greatly influences cognitive function, which 
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includes decision making, perception, learning, and maintaining health [1]; (3) Emotion 
significantly affects the course of actions, execution, control, and the way one explains 
their actions [12]. It also aids our daily lives in various ways. Most evidently, emotion 
is known to provide memory benefits, though it enhances memory more for negative 
experiences than positive ones [13]. As social signals, it serves as a tool for humans to con- 
nect with others, navigating social interactions so that we can resonate with and care for 
another person [14]. The positive and negative nature of emotions can also affect human 
health, as well as influencing and experiencing work processes [15], [16]. 


There is no unitary definition for emotion [17], and the way emotion is perceived 
has vastly varied throughout modern science. The way people feel certain emotions can 
be different from how others do, and the description of an emotion one feels can vastly 
vary from another [18]. P. R. Kleinginna and A. M. Kleinginna [19] famously defined 
emotions as “a complex set of interactions among subjective and objective factors, mediated by 
neural~hormonal systems, which can (a) give rise to affective experiences such as feelings of arous- 
al, pleasure/displeasure; (b) generate cognitive processes such as emotionally relevant perceptu- 
al effects, appraisals, labeling processes; (c) activate widespread physiological adjustments to the 
arousing conditions; and (d) lead to behavior that is often, but not always, expressive, goal-directed, 
and adaptive.” Generally, emotion is described as the brain’s consistent response toward an 
external stimulus. To classify emotion, most approaches are either categorical or dimen- 
sional. The categorical approach classifies emotions into a group of defined emotional 
states, or discrete emotions. Many researchers have presented various ways to define such 
groups: 


1) Kemper [20] stated that there are four primary emotions - fear, anger, depres- 
sion, and satisfaction - and many secondary emotions that can be acquired via 
social agents like guilt, shame, pride, gratitude, love, nostalgia, and ennui. 

2) Levenson [21] stated that basic emotions need to meet three criteria of dis- 
tinctness, hard-wiredness, and functionality, and he found six emotions that 
suffice: enjoyment, anger, disgust, fear, surprise, and sadness. 

3) Ekman and Cordaro [22] stated that most basic emotions share 13 common 
characteristics and summarized seven basic emotions: Anger, fear, surprise, 
sadness, disgust, contempt, and happiness. 

4) Plutchik [23] proposed eight basic emotions described in a wheel model: joy, 
trust, fear, surprise, sadness, disgust, anger, and anticipation. 

5) Izard [24] proposed 10 basic emotions: interest, joy, surprise, sadness, fear, 
shyness, guilt, anger, disgust, and contempt. 


On the contrary, the dimensional approach better captures the complexity of emo- 
tion by using a scale for different affective states. While this approach is still incompati- 
ble with depicting the full complexity and high-dimensional nature of emotions [25], it 
allows much easier implementation and interpretability as well as improving accuracy for 
emotion classification systems. One of the most commonly used dimensional models in 
emotion classification systems is Russell’s circumplex model of affect, shown in Fig. 1[26]. 
It is a two-dimension model that uses two scales - a valence scale and an arousal scale. 
The valence scale shows how positive or negative a person is feeling, and the arousal scale 
shows how high or low a person’s attention level is. Russell’s and Steiger’s extended ver- 
sion of this model, often regarded as the valence-arousal-dominance (VAD) model, [27] 
is also just as commonly used, which adds an extra dimension called dominance to show 
how in-controlled or submissive a person is feeling, shown in Fig. 2. 
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Fig. 1. Russell’s circumplex model of affect. 
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Fig. 2. Russell’s and Steiger’s VAD model [28]. 


2) EEG 

EEG is the electrophysiological process of measuring and collecting information 
on electrical fields in the brain by placing electrodes on the scalp [29], providing a dis- 
play of electrical activities in the brain in the form of waves with varying frequencies, 
amplitudes, and shapes. These fields are formed when billions of electrochemical signals 
spontaneously pass between brain neurons in an extended space, causing the sum of these 
fields to be powerful enough to be recorded from outside, making EEG a completely safe 
and non-invasive procedure. In other words, EEG signals are the result of impulses of 
brain neurons. EEG does not record the activities of every individual neuron but rather 
captures the signals created when neurons activate at the same time. One major limita- 
tion of EEG is that its signals, which come from cerebral activity, can be contaminated 
easily by other physiological signals from other electrical activities generated by the body 
during common occasions such as movements or eye blinks, causing EEG signals to be 
non-linear, non-stationary, and often overwhelmed by noise [30], [31]. Raw EEG data is 
known to be extremely complex and challenging to interpret, requiring advanced analy- 
sis, signal processing, and feature extraction to be correctly interpreted [32]. EEG feature 
is a pattern associated with a particular sensory or cognitive process [29], which can cap- 
ture human states of mind and emotional states. Thus, EEG, especially when observed in 
combination with EEG features, is proven to be able to provide useful information that 
can reflect characteristics in response to emotional states, leading to it being a very pop- 
ular choice for the task of emotion recognition [33], [34]. 


EEG electrodes record the signals of a small area surrounding them. An EEG head- 
set can have as few as four electrodes and as many as 256 electrodes. Electrodes are often 
distributed all across the scalp to cover as much area as possible. Strategically placing these 
electrodes to yield the most satisfactory recording is one of the most important aspects of 
an EEG test, study, and lab research. One standardized technique to place these electrodes 
across the scalp is to follow the International 10-20 system, depicted in Fig. 3 [35], [36]. 
The International 10-20 system uses four anatomical landmarks on the scalp - the nasion 
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(the bridge of the nose), the inion (the lowest point of the skull at the back of the head), 
and the left and right preauricular points (the point just in front of the ears where the 
upper jaw and the lower jaw meet) - as references to ensure consistent placement across 
EEG research: 


1) The nasion: The bridge of the nose. 

2) The inion: The lowest point of the skull at the back of the head. 

3) The left and right preauricular points: The two points located just in front of 
the ears where the upper jaw and the lower jaw meet. 


The numbers 10 and 20 refer to the distance between one electrode to another. 
Every electrode that is next to an anatomical landmark will be placed 10% of the entire 
distance from the front to the back or from the right to the left of the brain, and 20% of 
the distance for every other electrode. A unique label is assigned to every electrode, each 
consisting of one or two letters and one number, representing the location of the brain 
where the electrode is placed. The assigned number is even if the electrode is placed in the 
right hemisphere of the brain, and is odd if the electrode is placed in the left. The assigned 
letter (or letters) is an abbreviation of different parts of the brain: Fp for pre-frontal, F for 
frontal, T for temporal, O for occipital, P for parietal, and C for central. 


B Nasion 


Inion 10% 


Fig. 3. The International 10-20 electrodes placement system (Source: [37]). 


An extension of the International 10-20 system, the 10-10 system, is also a very 
commonly used electrode placement system in EEG research. Instead of 10% and 20% 
intervals, the 10-10 system only uses 10% intervals, which allows more possible positions 
for electrode placements, which can be more useful for purpose-specific signal recording 
in research. 
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10% and 20% intervals 10% intervals 


Common usage Clinical EEG, routine Research, studies 
monitoring 


Convenience Easier to set up and More complex, require 
interpret more setup time 


Applications Routine diagnostics, Detailed brain mapping, 
sleep studies research studies 


Table 1. A comparison of the 10-20 system and the 10-10 system. 
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Fig. 4. The 10-10 electrode placement system. 


Thanks to the rapid development of dry electrodes and consumer-grade EEG de- 
vices from companies such as Emotiv, OpenBCI, and NeuroSky, there has been a mas- 
sive boost in EEG and EEG-based emotion recognition research. Compared to the most 
commonly used wet electrodes, dry electrodes yield performance on a similar level to wet 
electrodes (often considered the gold standard for EEG electrodes) [38], [39], [40], [41] 
while still presenting a number of advantages: 


1) Less preparation time: Dry electrode takes significantly less preparation 
time, as EEG headsets with wet electrodes require additional steps such as ap- 
plying conductive gel [39], [42]. 

2) Better user comfort: With the use of conductive gel in wet electrodes, partic- 
ipants often experience discomfort, including skin irritation and inconvenient 
scalp preparation, especially when used for an extended time duration [38], 
[40], [43]. 

3) Easy maintenance: Since dry electrodes do not require the use of conductive 
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gel, they are much easier to clean and require much less maintenance effort. 
4) Portability: EEG headsets with dry electrodes are more wearable and porta- 
ble, as well as requiring very little scalp preparation [44]. 
5) Low cost: EEG systems with dry electrodes cost significantly less compared to 
systems with wet electrodes [41]. 


3) EEG 
Emotion recognition refers to the task of identifying and interpreting human emo- 
tions with the help of various indications such as facial expressions, body language, or 
physiological signals like photoplethysmography, electromyography, and electrocardio- 
gram (ECG). In our case, EEG-based emotion recognition uses EEG, the most commonly 
used option among physiological signals [6], [45], [46]. 

In an EEG-based emotion recognition process, a study typically needs to recruit 
participants and select a stimulus to evoke a targeted emotion. During a recording ses- 
sion, the participant is often asked to wear an EEG device. The participant is then ex- 
posed to the stimulus, and the voltage fluctuations in the brain are then recorded. EEG 
presents an image of electrical activities in the brain. The data is then passed through a 
preprocessing stage, where it is denoised and artifacts get filtered out. The clean data is 
then analyzed and goes through the feature extraction stage - one of the most essential 
stages in EEG-based emotion recognition. Only then, a ML classifier is used for training 
and learning the EEG data, which allows it to predict the emotional state of a person, 
producing the final output. 


4) Deep learning 

For the task of emotion recognition, artificial intelligence (AI)-based classifiers 
using ML algorithms have been widely used, especially in recent years, due to the rapid 
development of AI in general and ML specifically. While the terms AI, ML, and DL are 
often incorrectly used in interchange with each other, they are completely separate terms 
that cover different ranges of algorithms and techniques. AI includes all algorithms and 
techniques that allow computers to mimic human behavior [47]. A lot of AI systems be- 
have on hard-coded statements, which limits their adaptability and ability to complete 
tasks with a high level of complexity. ML overcomes this by learning and improving 
through experience. In more technical terms, ML systems train through many iterations, 
compare their predictions to the actual answers to calculate a loss value, and then improve 
and adapt by modifying their algorithm to minimize the loss. When the loss or the error 
rate becomes a constant after many iterations, indicating that it is minimized, and the 
model's performance stops improving, this state of a ML model is often referred to as 
convergence. Literature tackling emotion classification using ML often approaches it ei- 
ther as a classification or a regression problem. A classification task has outputs (also often 
regarded as labels in ML) that can be divided into a finite number of groups. An example 
of this would be predicting whether an email is spam or not, in which the labels can be 
divided into two groups - spam or not spam. A regression task has continuous outputs 
that are specific values that cannot be divided into a finite number of groups. An example 
of this would be predicting house prices, in which the output can be any positive decimal 
value. There are four types of ML algorithms: 


1) Supervised learning: These algorithms train using datasets that have pairs 
of input and label and then later predict an outcome from a completely new 
input. Most ML algorithms fall in this group. 

2) Unsupervised learning: These algorithms train using datasets that only have 
outcomes and no inputs. The algorithm would then use the available data to 
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complete a task such as association (e.g. detecting a pattern in a user’s activities 
in a music app and then using that pattern to suggest a song that he or she 
would enjoy).Easy maintenance: Since dry electrodes do not require the use 
of conductive gel, they are much easier to clean and require much less mainte- 
nance effort. 

3) Semi-supervised learning: In most supervised learning tasks, it is rare to 
find a complete dataset. While not as commonly used, some regard to tasks 
where a fraction of the dataset is missing labels as semi-supervised learning. 

4) Reinforcement learning: These algorithms help a system learn to make de- 
cisions to optimize its performance using the trial-and-error learning process, 
which is mostly used in game theory. One well-known example of this is Goo- 
gle’s AlphaGo, an AI that famously defeated Go world champion Lee Sedol in 
2016. 


DL builds on a ML technique called artificial neural network (ANN), which 
mimics how a neural network in the human brain works, amplifying or reducing 
signals transmitted between neurons by increasing or decreasing a weight value 
that is assigned to each neuron. DL systems often show better learning capabilities 
and accuracies compared to ML systems in many applications with high-dimen- 
sional data like text, image, video, and audio data, which is why there has been a 
massive interest in DL applications. 


Artificial neural network 

In the human brain, there are approximately 86 billion neurons forming 100 trillion 
connections to each other. To process the unending flow of information from the body, 
these neurons create electrical impulses non-stop to move and transmit information in 
a neural network. This is the basis and the inspiration for the artificial neural network: 
To complete a complex task, a neuron will pass on information to another neuron [48], 
which allows it to thrive in tasks with large amounts of data that linear ML algorithms 
would not be able to handle. A key characteristic of ANN is its ability to learn and extract 
features from data, helping it make optimal decisions. An ANN has three types of layers: 


1) Input layer: This is the layer where the data enters the ANN. Each neu- 
ron represents a distinct feature in the data. Every ANN only has one 
input layer. 

2) Hidden layer(s): This is where the ‘learning’ process happens in an 
ANN. The depth (the number of hidden layers) and the width (the num- 
ber of neurons in each hidden layer) are hyperparameters that form the 
network architecture, which allows the network to extract and recog- 
nize complex patterns from the data. An ANN can have one or more 
hidden layers. 

3) Output layer: This is the final layer of the network. The output of this 
layer is the output of the whole network. Every ANN only has one out- 
put layer. 
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Fig. 5. An example of an artificial neural network, produced using NN-SVG [49]. 


In an ANN, connections between neurons allow them to process information from 
each other with the use of weights. Each connection has its own weight, and the closer 
the value of the weight is to 0, the weaker the connection is. When an ANN learns via 
training iterations, the ANN adjusts these weights accordingly in order to minimize the 
error. However, for the network to process and learn complex patterns like those from 
real-life applications, this linear learning process would often not be good enough. To add 
a non-linear aspect to neurons’ connections, an ANN uses activation functions, which 
process the output of a neuron to determine whether it should be activated or not. Ac- 
tivation functions are only applied in hidden layers, and the activation function used in 
each hidden layer varies, making it another hyperparameter. In a linear ANN model, the 
activation function would simply be f(x)=x. For more complex ANN models, other acti- 
vation functions are used like Rectified Linear Unit (ReLU), Sigmoid, and Tanh. Finally, 
a bias might be applied to adjust the output independently of the input data, allowing the 
model to better fit the data when training. 


Deep learning 

ANN is the backbone of DL. The field of DL focuses on utilizing neural networks 
(NN) in a more complicated way to learn complex patterns in large datasets. DL has con- 
tinuously outperformed traditional ML techniques and achieved state-of-the-art results 
in emotion recognition [50], [51], [52] thanks to its effectiveness in processing auditory 
and visual data. Unlike traditional ML techniques, DL removes the need for manual fea- 
ture extraction and feature engineering as it can automatically learn from raw data after 
many iterations (often regarded in ML as epochs). DL models excel when there is a large 
amount of data, as their performance generally improves as more data is fed into them, 
making them suitable for applications with large datasets. However, a major limitation of 
DL models is training them can be computationally expensive, requiring powerful GPUs, 
large amounts of memory, and long training time. 


Convolutional neural network 
Convolutional neural network (CNN) is a deep learning technique that specializes 
in processing structured grid-like data like images. The most fundamental component of 
CNN is the convolutional layer. A convolutional layer has parameters called kernels or 
sliding windows, which act as filters sliding over input data and performing convolution 
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operations to detect complex patterns in the data. The output of a convolutional layer 
is called a feature map. To make CNN models more robust to minor changes and noise 
like a slight shift or a slight rotation in an image, pooling (subsampling) layers are used 
to reduce the width and height of the feature maps, filtering out insignificant details and 
retaining the most useful information. This process is often regarded in ML as downs- 
ampling. After the data is processed through many convolutional and pooling layers, the 
output is passed to one or several fully connected layers similar to an ANN. The fully con- 
nected layers used the features extracted to produce the final output, which in the example 
shown in Fig. 6. would be class labels. 


Fully 


Convolution Connected 


Pooling oe" 


Feature Extraction Classification 


Fig. 6. An example of CNN [53] 


3. Materials and Methods 
1) Dataset 

The DREAMER dataset is a commonly used public dataset in the field of affective 
computing, especially for emotion recognition based on EEG and EGG. It is a multi-mod- 
al dataset that contains both EEG and ECG signals recorded during affect elicitation us- 
ing audio-visual stimuli [54]. The dataset includes recordings from 25 participants (14 
males and 11 females) aged range between 22 and 33 with an average age of 26.6 years 
old, though technical problems caused recording from two subjects (both females) to be 
unsuitable for use. In emotion recognition, audio-visual stimuli (video) guarantee more 
valence intensity in subjects compared to visual-only stimuli (images) [55]. In each ses- 
sion, which lasted approximately one hour, each subject started by watching a neutral film 
clip in order to record the baseline signals and to return the emotional state of the subject 
back to normal after watching a emotion-eliciting film clip. Every subject will watch 18 
film clip with nine different targeted emotions (calmness, surprise, amusement, fear, ex- 
citement, disgust, happiness, anger, and sadness) with durations ranging from 65 to 393 
seconds (the mean duration was 199 seconds). While the subject is viewing the film clip, 
their EEG was being recorded at a sample rate of 128 Hz using the Emotiv EPOC wireless 
EEG headset, which has 16 gold-plated contact sensors placed on locations in accordance 
to the International 10-10 system. One mastoid sensor was placed at M1 as a ground ref- 
erence point to measure the voltage for other sensors, and another mastoid sensor was 
placed at M2 as a feed-forward reference for reducing external electrical interference. The 
other 14 sensors were placed in the following locations: AF3, F7, F3, FC5, T7, P7, O1, 
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O2, P8, T8, FC6, F4, F8, and AF4, as shown in Fig. 7. After watching each film clip, the 
subject will be asked to evaluate their emotion (based on what they actually felt, not what 
they thought the targeted emotion of the film clip is) by reporting the felt valence, arous- 
al, and dominance scores on a five-point scale for each of them. An example of how the 
EEG signal plot would look like is shown in Fig. 8, which is the signal plot for one second 
of recording taken randomly from the DREAMER dataset. The recorded EEG and ECG 
data, the participants’ data, and their valence, arousal, dominance evaluation were then 
stored in a data structure. A summary of the experiment is shown in Table 2. 
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Fig. 7. All sensor locations used in the DREAMER dataset are colored in red. 
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Fig. 8. The signal plot for one second of recording taken from the DREAMER dataset. 
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Number of partici- 
pants 


Number of males 


Rating scales Arousal, Valence, Dominance 
Recorded signals 14-channel 128Hz EEG 


Table 2. The experiment summary of the DREAMER dataset [54]. 


Since we are developing an EEG-based emotion recognition system, only the 
EEG data of the participants is considered. A threshold of three was used to convert the 
arousal, valence, and dominance scores from 1-5 to zero and one, which means all scores 
with values greater or equal to three would be converted to one, and all other values 
would be converted to 0. This is done to minimize the effect of the possible variation 
for each participant in the evaluation of elicited emotion. 


2) Experimental Setup 
Metrics 
To compare our model to the TSCeption model, we evaluate the performance of 
each model using many different metrics: precision, recall, F1 score, accuracy, Matthew’s 
correlation coefficient (MCC), Cohen’s kappa, and area under the receiver operating 
characteristic curve (AUROC). All of those metrics can be calculated using the true posi- 
tive (TP), true negative (TN), false positive (FP), and false negative (FN) outcomes of the 
models, which represent when: 
1) TP: Both the prediction and the actual value are positive 
2) TN: Both the prediction and the actual value are negative 
3) FP: The predicted value is positive, but the actual value is negative 
4) FEN: The predicted value is negative, but the actual value is positive. 
Despite the most applied metrics for evaluating binary classification models be- 
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ing F1 score and accuracy, we evaluated the models using this wide variety of metrics 
as each comes with its own strengths and limitations. Thus, for each different potential 
application of the model, one metric may provide a better evaluation compared to others. 
Precision, recall, F1 score, accuracy, and MCC can be calculated using these equations: 


(1) Precision = —. 
TP+FP 
TP 
(2) Recall = rrr 


Precision:Recall 
(3) Flscore=2: 


Precision+Recall 


TP+TN 
(4) Accuracy = ————_ 
TP+TN+FP+FN 
TP-TN-FP-FN 
V(TP+FP)(TP+FN)(TN+FP)(TN+FN) 


(5) MCC = 


TP TN 


1 
(6) AUROC = 5 Gecae | Ge 


OH t= 
pp (TP+FP):-(FP+TN)+(TP+FN)-(FN+TN) 


Precision is defined as the number of correct positive predictions divided by the 
total number of predicted positives. In other words, it evaluates the accuracy of a model’s 
positive prediction. Recall is defined as the number of correct positive predictions divided 
by the number of actual positive values. In our case, this is the number of ones in predict- 
ed values divided by the number of ones in labels. Combining the pair of precision and 
recall provides a useful evaluation of classification models, which the F1 score achieves by 
calculating the harmonic mean of precision and recall. Thus, F1 score remains one of the 
most popular choices for evaluating classification models. However, it is worth noting 
that F1 score also has a number of disadvantages that prevent it from fully reflecting the 
performance of a ML model: 


1) The F1 score can be easily affected when dealing with an imbalanced class as it 
can be dominated by the majority class. 

2) As can be seen in equation (3), Fl score does not take into account any true 
negative values, which can result in misleading conclusions, especially when 
evaluating models with a task that might be important to understand true neg- 
atives like certain medical tests or anomaly detections.FP: The predicted value 
is positive, but the actual value is negative 

3) Precision and recall have equal weights in the calculation of F1 score, assum- 
ing equal importance between them, which results in domain knowledge not 
being taken into account. In many cases, this can lead to significant misleading 
conclusions from the evaluation (e.g., precision might be more important in 
fraud detection, whereas recall might be more important in medical diagnos- 
tics). 

4) The F1 score is sensitive to minor changes in small datasets, which can result 
in instability in the metric. 

5) As the F1 score is the harmonic mean of precision and recall, it is less 
interpretable compared to other metrics like precision, recall, or accu- 
racy. 


Accuracy, like the F1 score, is one of the most popular choices for eval- 
uating a ML model. It is highly utilized, especially among models built for 
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non-complex tasks, due to its simplicity in calculating and interpretability. Ac- 
curacy represents the number of correct predictions divided by the total number 
of predictions. This simplicity, on the contrary, also brings many limitations to 
the metric. Accuracy is only effective when the dataset has a balanced class, as in 
a dataset with class imbalance, a model can achieve a good accuracy score simply 
by having all predictions as the majority class, resulting in ineffective training. 
Moreover, since accuracy only considers true positives and negatives, the metric 
lacks information that would otherwise be reflected in other metrics like preci- 
sion and recall, which might lead to incomprehensive conclusions. 

MCC, ranging from -1 to +1 represents the correlation coefficient between 
the observed and predicted binary classifications. Since MCC takes into account 
all four categories of the confusion matrix (TP, FP, TN, and FN), MCC is consid- 
ered a balanced metric, particularly useful for imbalanced datasets [56]. A MCC 
of +1 indicates perfect predictions, 0 indicates a prediction capability equal to 
completely random, and -1 indicates complete disagreement between predic- 
tions and actual outcomes. 

AUROC (sometimes simply referred to as AUC) measures the ability of 
a binary classifier to distinguish between positive and negative classes across 
different thresholds. It is calculated by finding the area under the ROC curve, 
which plots the true positive rate (TPR) against the false positive rate (FPR). 
AUROC ranges from 0 to 1, with 1 indicating a perfect classifier, 0.5 indicating 
a completely random classification, and any value <0.5 indicating a worse capa- 
bility than random guessing. 

Cohen's kappa measures the agreement between two raters (in our case, 
the DL model and the labels) while taking into account the possibility of agree- 
ment occurring by chance. This metric is widely used to assess inter-rater reli- 
ability and in classification tasks with more than two classes. The kappa ranges 
from -1 to 1, with +1 indicating perfect agreement beyond chance, 0 indicating 
that agreement is no better than chance, and any value <0 indicating systematic 
disagreement. 


Software, Programming Language, and Libraries 

All data pre-processing, analysis, and evaluation were conducted and implemented 
using Python 3.10.12 and the TorchEEG library on Google Colab. Developed by Zhang 
et al. [57], TorchEEG is the first and one of the most widely used DL toolboxes for EEG- 
based emotion recognition, which was developed to aid researchers with EEG and ML 
research. TorchEEG separates the DL workflow into five modules: datasets, transforms, 
model_selection, models, and trainers, each with its own plug-and-play functionalities for 
EEG-based emotion recognition. These modules incorporate state-of-the-art algorithms 
in the field, one of which we utilized for result comparison was the TSception model, as 
well as provide unique adaptations of DL models like transformers and diffusion models. 
Since the experimental protocol of DREAMER was implemented using the MATLAB 
environment [54], the SciPy library is used for loading the DREAMER dataset into the 
Google Colab notebook. We used the Scikit-learn library to implement some ML algo- 
rithms, which include logistic regression, random forest, and support vector machine, 
to do the same task with the DREAMER dataset for comparison. Our multi-scale CNN 
was implemented using PyTorch Lightning, a popular wrapper for PyTorch developed by 
Meta AI, and was heavily inspired by TSception [7]. Compared to other commonly used 
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libraries like Google’s TensorFlow or PyTorch, PyTorch Lightning has three strengths: 
(1) it minimizes boilerplate code for training loops, (2) it has a great advantage in distrib- 
uted GPU training for scalability, and (3) it offers built-in features for model checkpoints, 
logging, and experiment tracking. These strengths allow developers and researchers to 
focus on the experimental aspects of a DL model, which is why we chose to utilize Py- 
Torch Lightning for the development of this multi-scale CNN model. 


3) Methodology 

This section presents the mechanism behind our approach. In our EEG-based emo- 
tion recognition workflow, we first preprocess the raw EEG data so to increase the model 
performance, then the key features of the data are extracted using a multi-scale CNN, 
which is then finally passed into a classifier to perform emotion classification. 


Data Preprocess 

In EEG-based emotion recognition, preprocessing raw EEG data is one of the most 
essential steps for several reasons. Raw EEG signals often contain a lot of noise from body 
movement, electrical interference, and external environmental factors, and can be over- 
whelmed very easily by other artifacts [30]. EEG signals can also vary across individuals 
due to the many differences like head size. Thus, it is crucial to preprocess the raw EEG 
data from the DREAMER dataset. This would not only prevent the EEG data from being 
too noisy and complex but also allow the DL model to improve its reliability and com- 
putational expense and generalize well across different subjects, making it more suitable 
for real-world EEG-based emotion recognition applications [58]. A high-level overview 
of the data preprocessing step is shown in Fig. 9. First, we subtracted the baseline signals 
from the emotional signals, then normalized the data using Z-score normalization; next, 
we converted the data from a MAT file to PyTorch tensor to a two-dimensional repre- 
sentation, converting all labels to a binary form of zero or one, and finally split the data in 
two ways: five-fold cross-validation and train-validation-test. 


Loading the Baseline signals Z-score normalization Convert to a PyTorch Convert to a 2D Convert all labels to Split the data 


DREAMER dataset subtraction tensor representation binary 


Fig. 9. Data preprocessing overview. 


After loading the DREAMER dataset into the Google Colab notebook environment 
using the SciPy library, all of the data preprocessing was done using built-in functions 
from TorchEEG’s transforms module. First, the BaselineRemoval method was used to 
subtract the baseline signals from the experimental signals (or emotional signals). The 
baseline signals represent the electrical activity of the brain in a neutral or resting state 
with no elicited emotional state. In the DREAMER dataset, these signals are achieved by 
showing a neutral video clip used in the work of Gabert-Quillen et al. [59]. By subtracting 
the baseline signals from the emotional signals, the noisy signals from background activ- 
ities and signals that are unrelated to the emotional stimulus are removed, highlighting 
the brain’s neural activity in response to the emotion-eliciting film clips. Moreover, every 
individual has a distinct neutral brain activity due to various factors, such as mood, cogni- 
tive state, and anatomical differences [60], [61]. Subtracting the baseline signals will allow 
our EEG-based emotion recognition system to better generalize across different sessions, 
individual users, and stimuli in real-life applications. 
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Next, the MeanStdNormalize method was utilized to perform Z-score normal- 
ization on the EEG data. This normalization technique scales every value in a dataset 
so that the mean of all values is zero and the standard deviation is one. The formula for 
Z-score normalization is presented in Equation: 


me a2 
Z=—, 


where Z is the calculated value, x is the original value, “is the mean of the data, and O0 
is the standard deviation of the data. EEG typically consists of multiple channels, one for 
each electrode, that can have different baseline levels and variances. Z-score normaliza- 
tion ensures that signals from different channels and individuals are comparable. This 
allows the EEG-based emotion recognition system to be more robust, as well as helping 
ML and DL models to perform better and converge faster since the input features now 
have similar scales. 

Furthermore, we also used the ToTensor and the To2d methods to convert the 
EEG data to a PyTorch tensor, then to a two-dimensional EEG signal representation, 
with the electrode index as the row index, and the temporal index as the column index 
(an additional dimension is also appended only because PyTorch requires an extra 
dimension to perform convolution on a two-dimensional tensor). This is a mandatory 
step for our model, as we utilized CNN for emotion recognition, which is designed to 
capture spatial patterns and features, thus the need to treat the EEG data as a two-di- 
mensional representation. Moreover, by arranging the electrode index as the row index, 
the spatial relationship between electrode placements is preserved. This is beneficial to 
our model as the spatial configuration of the electrodes [62]. With the temporal index 
as the column index, the temporal information is also preserved, which can be crucial 
for real-life applications as EEG-based emotion recognition systems need to recognize 
emotional states that are unfolding over time. 

After that, we utilized the Binary method to convert all labels (the valence, 
arousal, and dominance scores) from a scale of 1-5 to a value of zero or one. As men- 
tioned in section 3.2.1, this conversion minimizes the effect of the possible variation for 
each participant in the evaluation of elicited emotion, helping the model generalize in 
real-life applications. 

Finally, we split the data in two ways: five-fold cross-validation and train-vali- 
dation-test split. For five-fold cross-validation, we split the partitioned the dataset into 
five near equally-sized subsets (or folds) at the dimension of trials, with four subsets 
being the training data and one being the test set. 

For the train-validation-test split, we applied the standard 80-20 split ratio 
twice, which means the ratio between the train, validation, and test sets is 64:16:20. 
The model’s performance on each split method is evaluated separately. This is done to 
ensure that the performance evaluation of the model is as objective and comprehensive 
as possible. 


Multi-scale convolutional neural network model 
The architecture of our multi-scale CNN model is inspired by TSception devel- 
oped by Ding et al. [7]. We utilized multi-scale 1D convolutional temporal kernels (T 
kernels) in order to extract time-frequency representations and features in EEG data. 
This is important due to how rapid brain activities can vary as a person is being ex- 
posed to the stimulus, which can be seen from an example shown in Fig. 10. However, 
while TSception used the ratio coefficients of [0.5, 0.25, 0.125] from their hypothesis 
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that multi-scale T kernels can better learn dynamic frequency representations in EEG, 
we decided to use the ratio coefficients of [0.5, 0.25, 0.125, 0.0625, 0.03125] so that the 
model can learn even more diverse representations. Although the higher-level T kernels 
with smaller ratio coefficients reduce the lengths of the convolutional kernels, allowing 
the model to learn more diverse representations, one limitation of this approach is that 
it can lead to the model being more computationally expensive and taking more time 

to run. Kernels with high ratio coefficients will learn low-frequency and long-term 
representations and vice versa. In PyTorch Lightning, while there is no built-in method 
of implementing a one-dimension convolutional kernel, we implemented this by setting 
the stride of the kernel to one. Average pooling is applied after every convolutional op- 
eration to prevent noise in the EEG data from affecting the performance of the model. 
All outputs of every level of T kernels would then be concatenated along the feature 
dimension. 


Fig. 10. A topographic map of raw EEG signal across 0.75s. 


In the asymmetric spatial layer, TSception implemented two types of spatial ker- 
nels: global kernel and hemisphere kernel. Similar to their implementation, the global 
kernel has a size of (c,1), where c is the number of channels, and in our case with the 
DREAMER dataset, 14 channels. This is done so that the length of the kernel equals the 
channel dimension of the EEG data, allowing the kernel to learn global spatial informa- 
tion through the whole scalp. The hemisphere kernel has the function of extracting the 
features and the relations between the left and right hemispheres of the brain. The size 
and step of this kernel are both (0.5 - c,1), which allows the kernel to learn the patterns 
of the two hemispheres without overlapping, essentially extracting key features from 
them separately. However, we expanded on this approach by implementing an addition- 
al type of spatial kernel, which instead of learning the patterns from two sides of the 
brain separately, the kernel extracts information from four separate areas. Essentially, 
we used the same implementation approach to TSception, where we set both the size 
and step of this kernel (0.25 - c, 1), allowing the model to learn key neural features of the 
brain in four parts. For both the hemisphere and our novel kernel to be well applied, the 
sequence of channels in the input EEG samples needs to be arranged in a specific way 
so that the step of the kernel would extract information from the intended areas of the 
brain. In order to achieve this, we arranged the channels in the input EEG data in an 
anti-clockwise order. 

Before the output of the multi-scale CNN is finally passed onto the fully con- 
nected layer to classify the emotional states, the learned information of the three types 
of spatial kernel is then fused in a high-level fusion layer. Here, we implemented the 
proposed approach of TSception to use a 1D convolutional layer with a kernel size of 
(8, 1) (changed from the choice of (3, 1) from TSception due to our implementation of 
an additional type of spatial kernel) to fuse the learned information along the spatial 
dimension. 
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Classifier 
After the key information from the EEG data is extracted using the multi-scale 
CNN, it is then passed into a classifier to perform emotion classification, producing the 
final output. This classifier essentially works like the hidden and output layers of an 
ANN. For each score type, the number of classes in the output would be two, making it a 
binary classifier. 


4. Results 

In this section, we report and compare the results of our multi-scale CNN model 
against the state-of-the-art TSception model in terms of precision, recall, F1 score, accu- 
racy, MCC, AUROC, and Cohen’s kappa, as well as the results of both models in five-fold 
cross-validation. Across these metrics, our proposed method consistently outperforms 
TSception, demonstrating a significant improvement in EEG-based emotion recognition 
with the DREAMER dataset. The results of each model across different metrics are pre- 
sented in Table 3-8. 


Valence 


76.16 79.33 79.82 78.84 78.58 78.546 


Table 3. The testing accuracies across five folds for valence score in five-fold cross-validations. 


Arousal 


83.59 86.81 86.39 87.27 87.65 86.342 


Table 4. The testing accuracies across five folds for arousal score in five-fold cross-validations. 
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Dominance 


Foldo |Fold1 —[Fold2—s|Fold3.—| Fold 4 
85.46 87.98 87.79 87.46 87.378 


85.92 88.13 88.71 88.76 88.42 87.988 
a 


Valence 


Tsception VXAXE 73.20% 73.82% iS: 75.98% | 43.44% | 44% 82.84% 47.90% 


75.96% 74.70% 75.16% 76.87% 50.65% 83.68% 50.42% 
0.67% 1.50% 1.34% 0.89% 2.21% 0.84% 2.52% 


Table 6. The precision, recall, Fl score, accuracy, MCC, A ~OU, and Cohen's kappa of our mode 


Arousal 


a 


Tahle 7 Th Ci n . 1 ecor “Cy — MCC. A 
Table 7. The precision, recall, Fl score, accuracy, MCC, A 
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Dominance 


Tsception 79.08% 75.03% 76. 76.74% | 86.07% 53.95 | 95 3821 | 21 358 | 58 


81.44% 73.17% 76.12% 86.58% 53.98 88.74 S259) 


Table 8. The precision, recall, F1 score, accuracy, MCC, AUROC, and Cohen’s kappa of our mode 


and the TSception model for dominance scores. 


As shown in Table 3-5, when evaluating valence scores, our multi-scale CNN 
model outperforms TSception in all metrics; when evaluating arousal scores, our model 
outperforms TSception in every metric except recall; and when evaluating dominance 
scores, our model outperforms TSception in precision, accuracy, MCC, and AUROC. 
These results indicate that our model not only surpasses TSception in traditional metrics 
like accuracy and FI score but also presents significant improvements in more nuanced 
metrics like MCC and AUROC. Moreover, as shown in Table 6-8, even when comparing 
test accuracies from five-fold cross-validation, our method achieves substantial improve- 
ments from the results of the TSception model. When the average test accuracy of the 
five folds is calculated, our model shows better performance in all three valence, arousal, 
and dominance scores. Interestingly, for predicting valence scores, both these evaluation 
methods show that our model outperforms TSception in all metrics and in all five folds, 
indicating that our model is objectively better than TSception at recognizing valence in 
EEG-based emotion recognition. For the arousal score, while TSception presents a bet- 
ter result in recall, our model still obtained a higher F1 score, indicating a better balance 
between precision and recall, as well as a greater proficiency in identifying positive in- 
stances (when a person is feeling positive/pleased). Finally, for the dominance score, our 
model still outperforms TSception in most metrics, including precision, accuracy, MCC, 
and AUROC. 

Across all three valence, arousal, and dominance scores, our model consistently 
outperforms TSception in accuracy, MCC, and AUROC. For accuracy, our model im- 
proves TSception’s performance in valence score by 0.89%, arousal score by 1.02%, and 
dominance score by 0.51%, indicating that our model was making more correct predic- 
tions overall. For MCC, our model outperforms TSception’s coefficients in valence score 
by 0.0221, arousal score by 0.0068, and dominance score by 0.0003, suggesting that our 
model is better at handling both positive and negative classes well even in imbalance data- 
sets. Lastly, our model achieves similar improvements across all three scores compared to 
the TSception model in AUROC, which reflects the ability to discriminate between class- 
es of a model. Our model surpasses TSception in valence score by 0.0084, arousal score by 
0.0068, and dominance score by 0.0053. 


5. Discussion 

This study presented a novel approach to utilizing DL for EEG-based emotion rec- 
ognition with visual-auditory stimuli in a consumer-friendly setting, predicting the emo- 
tional state of a person in the form of valence, arousal, and dominance scores. Inspired 
by the TSception model, we used five different ratio coefficients to allow our model to 
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have more diverse representations in the EEG data than previous methods. Moreover, 
we expanded on the proposed model architecture of TSception, further using a similar 
approach to the model’s hemisphere kernel to implement a new type of kernel that cap- 
tures key features from four separate areas of the brain, while still retaining the global 
and hemisphere kernel from TSception. This approach results in our model consistently 
outperforming TSception in all valence, arousal, and dominance scores across multiple 
performance evaluation metrics, including five-fold cross-validation (test accuracy across 
five folds), precision, recall, F1 score, accuracy, MCC, AUROC, and Cohen’s kappa. Our 
study also opens up many potential applications in different real-life contexts of EEG- 
based emotion recognition, as our model was trained on a dataset recorded using a con- 
sumer-grade EEG device. 

Most EEG-based emotion recognition studies have been using other widely used 
datasets like DEAP, SEED, and MAHNOB-HCTI [6], [32], [63]. These datasets, while pro- 
viding more EEG data with higher quality, use research-grade EEG devices. The use of 
these devices prevents these studies from developing a system that can be more widely 
used in real-life contexts, as research-grade EEG devices are extremely costly and take a 
significant amount of time to set up. The DREAMER dataset recorded EEG data using 
the Emotiv EPOC system, a consumer-grade EEG system that takes way less time to set 
up and has more affordability while recording EEG data with satisfactory quality and 
reproducibility. By training on the DREAMER dataset, our model can better predict emo- 
tional states using EEG data from EEG systems with low cost and high user convenience, 
providing an inexpensive option for hospitals, clinicians, or mental health services with 
low budget to install such applications. 

With affordability and user convenience in mind, our EEG-based emotion rec- 
ognition system may be used in a wide range of real-life applications. One of the prime 
examples of such applications would be monitoring patients’ emotional state during a 
therapy session. In 2023, one in five people from Gen Z or Millennials (people who were 
born between 1981 and 2012) are currently treated with psychotherapy as it remains the 
most popular treatment method for mental disorders [64], [65]. As the communication 
and interaction between the therapist and the clinical patient is the most important as- 
pect of a therapy session [66], therapy experience for individuals with cognitive impair- 
ments, especially Alzheimer’s disease, becomes challenging due to them facing difficulties 
in communication and expressing themselves [67]. Hence, there is a rising need for ways 
to aid the communication between the therapist and individuals with Alzheimer’s disease 
in a therapy session. One method that can contribute to this purpose is to develop an 
automatic emotion recognition system that allows the therapist to understand what state 
of mind the patient is feeling. By monitoring the valence of a patient, the therapist can 
better understand their patient’s feelings, detect early warnings of mental problems like 
depression or anxiety, and then tailor treatments accordingly. During a therapy session 
that focuses on the use of a stimulus, not only the valence, but the arousal and dominance 
indication of a patient may provide the therapist with more information regarding the 
efficacy of the activity and whether a stimulus is working as intended. Music therapy is 
an evidence-based therapeutic practice that uses music to address the physical, cognitive, 
and social needs of an individual [68]. In an active listening activity in a music therapy 
session, where the patient would listen to a recording of a song, watch a music video, or 
a live performance of other patients and/or of music therapists. This activity would pro- 
vide a perfect setting for EEG-based emotion recognition to come into use, as the music 
therapist would be able to observe whether the activity is eliciting the planned emotion in 
the patient. Furthermore, the arousal score indicated by the emotion recognition system 
would be able to suggest the efficacy of the current music therapy activity. Attention is 
a human’s most basic resource at a psychic level [69], and thus, many studies exploring 
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the factors affecting the efficacy of music therapy for individuals with dementia place an 
emphasis on capturing the patient’s attention [70], [71]. The arousal score, which pres- 
ents the attention level of a person, would be a suitable indicator for the music therapist 
to adjust their activity in accordance with the score. Another useful application for this 
system would be in a group music therapy session, where tracking the emotional state of 
multiple patients at once may become challenging for the music therapist. In particular, 
the dominance score may bea good indication for the music therapist to recognize if there 
is a patient who is feeling left out or overwhelmed, which would be suggested by a low 
dominance score. 

Our study is not without its limitations and challenges, and a discussion regarding 
these limitations would open up future works and ideas. One major challenge that we 
faced during this study was the lack of data for consumer-friend EEG-based emotion rec- 
ognition. While the DREAMER dataset was able to provide us with a satisfactory number 
of samples, the dataset only recorded data from 23 subjects. In the context of a real-life ap- 
plication for an emotion recognition system, it is essential that a model can perform well 
across many different subjects and different stimuli. For a ML model to achieve that, we 
hypothesized that it is better for the model to treat the whole recording of a subject with 
a stimulus as one sample. In the context of the DREAMER dataset, this would mean that 
the EEG signals recorded while one subject was watching a film clip would count as one 
sample. However, in the observed EEG datasets during this study, doing this would leave 
the ML model with too few samples to train on (e.g. the DREAMER dataset would have 
414 samples, as it has 23 subjects watching 18 film clips), making convergence extremely 
unlikely. This is the reason why most EEG-based emotion recognition systems would 
treat the recorded EEG signals during one time window (this varies in accordance with 
the recording frequency of the EEG device). Thus, we believe that the field of EEG-based 
emotion recognition systems with real-life applications would massively benefit from 
an EEG dataset recorded using a consumer-grade EEG device with emotion evaluations. 

We also encourage future works exploring EEG-based emotion recognition sys- 
tems to experiment with more unique DL architectures. While the TSception model 
achieve quality results by designing their architecture to capture the key information from 
the brain’s activity in the two hemispheres separately, our model managed to achieve con- 
sistent better results by separating the EEG data into four areas of the brain. Similar to 
TSception, our approach was inspired by the brain’s anatomy, specifically the four lobes of 
the cerebral cortex: the frontal lobe, the parietal lobe, the temporal lobe, and the occipital 
lobe, even though the electrode placements in the DREAMER dataset did not allow us 
to design an architecture that would capture this information perfectly. Hence, we highly 
encourage future works to try novel model architectures, even if they are illogical or if the 
logic behind them is incomplete. Additionally, future works may explore the performance 
of our model in a dataset that would better fit for capturing the brain information in the 
four lobes of the cerebral cortex. 

Though this study focuses specifically on the real-life application aspect of EEG- 
based emotion recognition, we were unable to obtain permission to put the model into 
real-life use in a suitable setting such as during an active listening activity in a music 
therapy session. Since modern EEG devices are non-invasive and completely safe with 
extremely low health risks [72], we suggest testing out such EEG-based emotion rec- 
ognition systems in a real-life setting in order to better evaluate the shortcomings and 
practicality of such systems. 

As related fields like technology, AI, ML, DL, affective neuroscience, and EEG- 
based emotion recognition advance in the future, the development and application of 
these systems should be further explored for their wide range of potential utilization. We 
call for the creation of more EEG-based emotion recognition datasets that would simulate 
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practical settings for the use of these systems in real life. We hope that this study succeeds 
in encouraging researchers to experiment with more architectures for such systems and 
to shift the focus to the development of them in real-life settings, providing easy access 
to hospitals, clinicians, therapists, and consumers. In the quest for further understanding 
the great complexity behind human emotions and using them to improve quality of life, 
the development of real-life EEG-based emotion recognition systems would ameliorate 
countless lives. 


6. Conclusion 

In this study, we developed a DL model to predict an individual’s emotional state 
using EEG signals using a multi-scale CNN approach inspired by TSception. By using 
more ratio coefficients for T kernels and creating a novel type of kernel that would cap- 
ture the key information of EEG data from four separate areas of the brain, we consistent- 
ly outperformed the results of the TSception model in all valence, arousal, and dominance 
scores across a wide range of evaluation metrics, including precision, recall, F1 score, 
accuracy, MCC, AUROCG, and Cohen’s kappa. This result indicates the potential of novel 
architectures in such systems as well as the need for testing such systems in a real-life 
setting. We call for further research to experiment with various types of model architec- 
tures, as well as the development of an EEG-based emotion recognition system using con- 
sumer-grade EEG devices with a considerable number of subjects and stimuli, allowing 
future development of these systems to better generalize in real-life applications. 


7. Conclusion 

I would like to thank my advisor Gia Ngo from the School of Electrical \& Com- 
puter Engineering, Cornell University for his advice and mentorship in the development 
of this research paper. 
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